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Abstract 

A fundamental problem in control is to learn a 
model of a system from observations that is use- 
ful for controller synthesis. To provide good per- 
formance guarantees, existing methods must as- 
sume that the real system is in the class of models 
considered during learning. We present an iter- 
ative method with strong guarantees even in the 
agnostic case where the system is not in the class. 
In particular, we show that any no-regret online 
learning algorithm can be used to obtain a near- 
optimal policy, provided some model achieves 
low training error and access to a good explo- 
ration distribution. Our approach applies to both 
discrete and continuous domains. We demon- 
strate its efficacy and scalability on a challenging 
helicopter domain from the literature. 



1. Introduction 

Model-based reinforcement learning (MBRL) and much of 
control rely on system identification: building a model of a 
system from observations that is useful for controller syn- 
thesis. While often treated as a typical statistical learn- 
ing problem, system identification presents different fun- 
damental challenges as the executed controller and data 
generating process are inextricably intertwined. Naively 
attempting to estimate a controlled system can lead to a 
model that makes small error on a training set, but exhibits 
poor controller performance. This problem arises as the 
policy resulting from controller synthesis is often very dif- 
ferent from the "exploration" policy used to collect data. 
While we might expect the model to make good predictions 
at states frequented by the exploration policy, the learned 
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Figure 1. Example train-test mismatch in a helicopter domain. 
Train: model is fit based on samples near the desired trajectory, 
e.g. from watching an expert. Test: learned policy ends up in new 
regions where model is bad, leading to poor control performance. 

policy usually induces a different state distribution, where 
the model may poorly capture system behavior (Fig. 1). 

This problem is fully appreciated in the system identifica- 
tion literature and has been attacked by considering "open 
loop" identification procedures and "persistent excitation" 
(Ljung, 1999; Abbeel & Ng, 2005) that attempt to suffi- 
ciently "cover" the state-action space. Unfortunately, such 
methods rely on the strong assumption that the true system 
lies in the class of models considered: e.g., for continuous 
systems, they may require the true system to be modeled 
in a class of linear models. With this assumption, they en- 
sure that eventually the correct model is learned- e.g., by 
learning about every discrete state-action pair or all modes 
of the linear system- to provide guarantees. 

In this work, we provide algorithms for system identifica- 
tion and controller synthesis (i.e. MBRL) that have strong 
performance guarantees with a weaker agnostic assumption 
that the system identification achieves statistically good 
prediction. We adopt a reduction-based analysis (Beygelz- 
imer et al., 2005) that relates the learned policy's perfor- 
mance to prediction error during training. We begin by 
providing agnostic bounds for a simple generic "batch" al- 
gorithm that can represent many learning methods used in 
practice (e.g., building a model from open loop controls. 
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watching an expert, or running a base policy we want to 
improve upon). Due to the mismatch in train/test distri- 
butions, uniform exploration is often the best option with 
this approach. Unfortunately, this makes the sample com- 
plexity and performance bounds scale with the size of the 
Markov Decision Process (MDP) (i.e. state/action space). 
Next, we propose a simple iterative approach, closely re- 
lated to online learning, with stronger guarantees that do 
not scale with the size of the MDP when given a good ex- 
ploration distribution. The approach is very simple to im- 
plement and iterates between 1) collecting new data about 
the system by executing a good policy under the current 
model, as well as by sampling from a given exploration 
distribution, and 2) updating the model with that new data. 

This approach is inspired by a recent reduction of imitation 
learning to no-regret online learning (Ross et al., 201 1) that 
addresses mismatch between train/test distributions. Our 
results can be interpreted as a reduction of MBRL to no- 
regret online learning and optimal control, and show that 
any no-regret algorithm can be used in such a way to learn 
a policy with strong agnostic guarantees. This enables 
MBRL methods to match the strongest existing agnostic 
guarantees of model-free RL methods (Kakade & Lang- 
ford, 2002; Bagnell et al., 2003). 

We first introduce notation and related work. Then we 
present the batch method and our online learning approach 
with their agnostic guarantees (proofs are deferred to the 
supplementary material). Finally we demonstrate the ef- 
ficacy of our approach on a challenging domain from the 
literature: learning to perform acrobatic maneuvers with a 
simulated helicopter (Abbeel & Ng, 2005). 

2. Background and Notation 

We assume the real system behaves according to some un- 
known MDP, represented by a set of states S and actions A 
(both potentially infinite and continuous), a transition func- 
tion T, where Tga denotes the next state distribution if we 
do action a in state s, and the initial state distribution /z at 
time I. We assume the cost function C : S* x A — >■ R 
is known and seek to minimize the expected sum of dis- 
counted costs over an infinite horizon with discount 7. 

For any policy tt, let tTs be the action distribution performed 
by TV in state s; D^ ^ the state-action distribution at time t 
if we started in state distribution cj at time 1 and followed 
Ti"; -Dc^,7r — (1 — 7) J2tLi l^^^^L TV ths State-action dis- 
tribution over the infinite horizon if we follow tt, starting 
in oj at time 1; K-(s) = ^a~TT^,s'~T^a.[C^{s,a) + jVTr{s')] 
the value function of tt (the expected sum of discounted 
costs of following tt starting in state s); (5^(5,0) = 
C{s, a) + ^Es'~Tsa [^tt{s')] the action-value function of tt 
(the expected sum of discounted costs of following vr af- 



ter starting in s and performing action a); and J(^(7r) = 
Es~a;[V^(s)] = j^E(^_a)^£,^_^[C(s,a)] the expected 
sum of discounted costs of following n starting in w. 

Our goal is to obtain a policy vr with small regret, i.e. 
for any policy vr', J^(vr) — J^{tt') is small. This is 
achieved indirectly by learning a model T of the system and 
solving for a (near-)optimal policy (under T); e.g., using 
dynamic programming (Puterman, 1994) or approximate 
methods (Szepesvari, 2005; Williams, 1992). For continu- 
ous systems, an important special case is linear models with 
quadratic cost functions, and potentially additive Gaus- 
sian noise, known as Linear Quadratic Regulators (LQR)' 
which can be solved exactly and efficiently. Non-linear sys- 
tems with non-quadratic cost functions can also be solved 
approximately (local optima) using efficient iterative lin- 
earization techniques such as iLQR(Li & Todorov, 2004). 

Related Work: In contrast with "textbook" system iden- 
tification methods, in practice control engineers often pro- 
ceed iteratively to build good models for controller synthe- 
sis. A first batch of data is collected to fit a model and 
obtain a controller, which is then tested in the real system. 
If performance is unsatisfactory, data collection is repeated 
with different sampUng distributions to improve the model 
where needed, until control performance is satisfactory. By 
doing so, engineers can use feedback of the policies found 
during training to decide how to collect data and improve 
performance. Such methods are commonly used in prac- 
tice and have demonstrated good performance in the work 
of Atkeson & Schaal (1997); Abbeel & Ng (2005). In both 
works, the authors proceed by fitting a first model from 
state transitions observed during expert demonstrations of 
the task, and at following iterations, using the optimal pol- 
icy under the current model to collect more data and fit a 
new model with all data seen so far. Abbeel & Ng (2005) 
show this approach has good guarantees in non-agnostic 
settings (for finite MDPs or LQRs), in that it must find a 
policy that performs as well as the expert providing the ini- 
tial demonstrations. Our method can be seen as making 
algorithmic this engineering practice, extending and gener- 
alizing the previous methods of Atkeson & Schaal (1997); 
Abbeel & Ng (2005), and suggesting slight modifications 
that provide good guarantees even in agnostic settings. 

Similarly, the Dataset Aggregation (DAgger) algorithm of 
Ross et al. (2011) uses a similar data aggregation proce- 
dure over iterations to obtain policies that mimic an ex- 
pert well in imitation learning. The authors show that such 

'LQR is defined by 4 matrices A,B,Q,R s.t. Xt+i = Axt + 
But + ^t, for xt and ut the state and action at time t, and 
^t ~ -^(0, E) is (optional) Gaussian white noise, and the cost 
C{x, u) = x^ Qx + u^ Ru (Q y 0, R >- 0). The optimal policy 
is linear (u — Kx) and the value function is quadratic {x^Vx). 
LQR can be solved by dynamic programming on V and K. 
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a procedure can be interpreted as an online learning al- 
gorithm (Hazan et al., 2006; Kakade & Shalev-Shwartz, 
2008), more specifically, Follow-the-(Regularized)-Leader 
(Hazan et al., 2006), and that using any no-regret online al- 
gorithm ensures good performance. Our approach can be 
seen as an extension of DAgger to MBRL settings. 

Our approach leverages the way agnostic model-free RL al- 
gorithms perform exploration. Methods such as Conserva- 
tive Policy Iteration (CPI) (Kakade & Langford, 2002) and 
Policy-Search by Dynamic Programming (PSDP) (Bagnell 
et al., 2003) learn a policy directly by updating policy pa- 
rameters iteratively. For exploration, they assume access 
to a state exploration distribution v that they can restart 
the system from and can guarantee finding a policy per- 
forming nearly as well as any policies inducing a state dis- 
tribution (over a whole trajectory) close to v. Similarly, 
our approach uses a state-action exploration distribution 
to sample transitions and allows us to guarantee small re- 
gret against any policy with a state-action distribution close 
to this exploration distribution. If the exploration distri- 
bution is close to that of a near-optimal policy, then our 
approach guarantees near-optimal performance, provided a 
good model of data exists. This allows our model-based 
method to match the strongest agnostic guarantees of ex- 
isting model-free methods. Good exploration distributions 
can often be obtained in practice; e.g., from human expert 
demonstrations, domain knowledge, or from a desired tra- 
jectory we would like the system to follow. Additionally, if 
we have a base policy we want to improve, it can be used 
to generate the exploration distribution - with potentially 
additional random exploration in the actions. 

3. A Simple Batch Algorithm 

We now describe a simple algorithm, refered to as Batch, 
that can be used to analyze many common approaches from 
the literature, e.g., learning from a generative model", open 
loop excitation or by watching an expert (Ljung, 1999). 

Let T denote the class of transition models considered, and 
V a state-action exploration distribution we can sample the 
system from. Batch first executes in the real system m 
state-action pairs sampled i.i.d. from v to obtain m sam- 
pled transitions. Then it finds the best model T E T of ob- 
served transitions, and solves (potentially approximately) 
the optimal control (OC) problem with T and known cost 
function C to return a policy vr for test execution. 

3.1. Analysis 

Our reduction analysis seeks to answer the following ques- 
tion: if Batch learns a model T with small error on train- 

^With a generative model, we can set the system to any state, 
perform any action to obtain a sample transition. 



ing data, and solves the OC problem well, what guarantees 
does it provide on control performance of tt? Our results 
illustrate the drawbacks of a purely batch method due to the 
mismatch in train-test distribution. 

We measure the quality of the OC problem's solution as fol- 
lows. For any policy tt', let e^J = Es^^[V*{s) - ^"'(s)] 
denote how much better tt' is compared to tt on model T 
{V^ and V^ are the value functions of tt and tt' under 
learned model T respectively). If tt is an e-optimal pol- 
icy on T within some class of policies 11, then e^^ < e 
for all tt' G n. A natural measure of model error that 
arises from our analysis is in terms of Li distance between 
the predicted and true next state's distributions. That is, 
we define e|;/j = 1E(s,q)~i.[||Tsq - tsaWi] the predictive 

error of T, measured in Li distance, under the training 
distribution ly. However, the Li distance cannot be eval- 
uated or optimized from sampled transitions during train- 
ing (we observe samples from Tga but not the distribution). 
Therefore we also provide our bounds in terms of other 
losses we can minimize from samples. This directly re- 
lates control performance to the model's training loss. A 
convenient loss is the KL divergence between Tga and Tga'- 



,KL 



E(s,a)~.,.'~T,Jlogma(s')) -log(T,a(s'))]. Mini- 
mizing KL corresponds to maximizing the log likelihood 
of the sampled transitions. This is convenient for com- 
mon model classes, such as linear models (as in LQR), 
where it amounts to linear regression. For particular cases 
where T is a set of deterministic models and the real system 
has finitely many states, the predictive error can be mea- 
sured via a classification loss at predicting the next state: 
ef^ = E^s,a)r.,,s,^Tji{f,s,a,s')], for £ the 0-1 loss of 
whether T predicts s' for (s, a), or any upper bound on 
the 0-1 loss, e.g., the multi-class hinge loss if T is a set of 
SVMs. In this case, model fitting is a supervised classifi- 
cation problem and the guarantee is directly related to the 
training classification loss. These are related as follows: 



Lemma 3.L e^/, < y^2e^f^ and efj., < 2ef^,. The latter 
holds with equality if l is the 0-1 loss. 

In general, we can use any loss minimizable from samples 
that upper bounds e^l^ for models in the class. Our bounds 
are also related to the mismatch between the exploration 
distribution v and distribution induced by executing an- 
other policy TT starting in ^, denoted c" = sup^ ^ tTsa) ■ 
We assume the costs C{s, a) G [Cmin, Cmax] V(s, a). Let 



a 



c„ 



Cm in and H 



7C™„ 



rng — ^max " ^mm """ ^^ — ^-7)2 ' H IS & Scaling fac- 
tor that relates model error to error in total cost predictions. 

Theorem 3.1. The policy tt is s.t. for any policy tt': 



M^)<M^') + e:, + ^ 



^1/ ~r C, , 



He 



Ll 
prd 



This also holds as a function ofe^jr^ or e^^^ using Lent. 3.1. 
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This bound indicates that if Batch solves the OC problem 
well and T has small enough error under the training dis- 
tribution i^, then it must find a good policy. Importantly, 
this bound is tight: i.e. we can construct examples where 
it holds with equality (see supplementary material). More 
interestingly is what happens as we collect more data. If 
the fitting procedure is consistent (i.e. picks a model with 
minimal loss in the class asymptotically), then we can re- 
late this guarantee to the capacity of the model class to 
achieve low error under the training distribution i'. We de- 
note the modeling error, measured in Li distance, as ej^^jj = 
miT'er^is.a)r^A\\Tsa - rial 111- Similarly, define e^, = 
infT'6rE(,,a)^.,.'^T.Jlog(T,a(s')) - log(r,'Js'))] and 

.T [e{T\s,a,s')]. These are 



cls 



inf 



-E 



T'eT'^is,a)~i',s'' 



all in realizable settings, but generally non-zero in agnos- 
tic settings. After sampling m transitions, the generaliza- 
tion error ej;jf^(m, S) bounds with high probability 1~S the 
quantity e^'^ - e[„[,i. Similai'ly, e^„{m,S) and ef^^{m,S) 
denote the generalization error for the KL and classifica- 
tion loss respectively. ege„(m, S) can be related to the VC 
dimension (or multi-class equivalent) in finite MDPs. 

Corollary 3.1. After observing m transitions, with proba- 
bility at least 1 — S, for any policy n': 



This also holds as a function of e 



KL 

mdl 



gen 



[in, 5) (or 
eg'^„{m,6)) using Lem. 3.1. In addition, if the fitting 
procedure is consistent in terms ofLi distance (orKL, clas- 
sification loss), then e^j„(m, 5) — )■ (or e^^^{ni,6) — >■ 0, 
e^genim, 6) —> 0) as m —)' oo for any S > 0. 

The generalization error typically scales with the com- 
plexity of the class T and goes to at a rate of 0(^^) 

(0( — ) in ideal conditions). Given enough samples, the 
dominating factor limiting performance becomes the mod- 
eling error: i.e. the term "^^ ^'^'^ He]^^^ (or equivalently 

^^^H^f2€^i and (c* + c-')iJe^^di) quantifies how per- 
formance degrades for agnostic settings. 

Drawback of Batch: The two factors c^ and c^ are qual- 
itatively different. cJJ measures how well i' explores state- 
actions visited by the policy vr' we compare to. This factor 
is inevitable: we cannot hope to compete against policies 
that spend most of their time where we rarely explore, c'^ 
measures the mismatch in train-test distribution. Its pres- 
ence is the major drawback of Batch. As tt cannot be 
known in advance, we can only bound c^ by considering 
all policies we could learn: sup^gn ^Z- This worst case is 
likely to be realized in practice: if i^ rarely explores some 
state-action regions, the model could be bad for these and 
significantly underestimate their cost. The learned policy 
is thus encouraged to visit these low-cost regions where 



few data were collected. To minimize sup^^jj c^, the best 
ly for Batch is often a uniform distribution, when possi- 
ble. This introduces a dependency on the number of states 
and actions (or state-action space volume) (i.e. cj + cJJ is 
0(|S'||A|)) multiplying the modeling error. Sampling from 
a uniform distribution often requires access to a generative 
model. If we only have access to a reset modeP and a base 
policy ttq inducing i/ when executed in the system, then cj 
could be arbitrarily large (e.g., if tt goes to probability 
states under ttq), and tt arbitrarily worse than ttq. 

In the next section, we show that iterative learning meth- 
ods can leverage feedback of the learned policies to ob- 
tain bounds that do not depend on c'^ . This leads to better 
guarantees when we have a good exploration distribution i/ 
(e.g., that of a near-optimal policy), or when we can only 
collect data via a reset model. This also leads to better per- 
formance in practice as shown in the experiments. 

4. No-Regret Methods for Agnostic MBRL 

Our extension of DAgger to the MBRL setting proceeds 
as follows. Starting from an initial model T^ e T, solve 
(approximately) the OC problem with T^ to obtain pol- 
icy TTi. At each iteration n, collect data about the sys- 
tem by sampling state-action pairs from distribution /9„ = 
^f + \D^.^^: i.e. w.p. i, sample a transition occurring 
from an exploratory state-action pair drawn from v and add 
it to dataset V, otherwise, sample a state transition occur- 
ring from running the current policy 7r„ starting in p, stop- 
ping the trajectory w.p. 1 — 7 at each step and adding the 
last transition to V. The dataset V contains all transitions 
observed so far over all iterations. Once data is collected, 
find the best model T"+^ € T that minimizes an appropri- 
ate loss (e.g. regularized negative log likelihood) on V, and 
solve (approximately) the OC problem with T"+i to ob- 
tain the next policy 7r„_|_i. This is iterated for N iterations. 
At test time, we could either find and use the policy with 
lowest expected total cost in the sequence 7ri:jv, or use the 
uniform "mixture" policy"* over 7ri:Ar. We guarantee good 
performance for both. The last policy ttjv often performs 
equally well, it has been trained with most data. Our ex- 
perimental results confirm this intuition. In theory, ttjv has 
good guarantees when the distributions D^^t^. converge to 
a small region in the space of distributions as i ^ 00, but 
we do not guarantee this always occurs. 

Implementation with Off-the-Shelf Online Learner: 

DAgger as described can be interpreted as using a Follow- 
The-(Regularized)-Leader (FTRL) online algorithm to pick 
the sequence of models: at each iteration n we pick the 



To sample transitions with a reset model, we can only simu- 
late the system forward in time, or reset to a random initial state. 
''At start of any trajectory, the mixture policy picks uniformly 
randomly a policy in 7ri:jv, and uses it for the whole trajectory. 
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best (regularized) model T" in hindsight under all samples 
seen so far. In general, DAgger can also be implemented 
using any no-regret online algorithm (see Algorithm 1) to 
provide good guarantees. This is done as follows. When 
minimizing the negative log likelihood, the loss function of 
the online learning problem at iteration i is: Lf^{T) = 

E(s,a)~p,,s'~T,,[-log(Tsa(s'))]- This can be estimated 
from sampled state transitions at iteration i, and evaluated 
for any model T. The online algorithm is applied on the se- 
quence of loss Lf^j^ to obtain a sequence of models T^'-^ 
over the iterations. As before, each model T* is solved to 
obtain the next policy tt^. By doing so, the online algo- 
rithm effectively runs over mini-batches of data collected 
at each iteration to update the model, and each mini-batch 
comes from a different distribution that changes as we up- 
date the policy. Similarly, in a finite MDP with a determin- 
istic model class T, we can minimize the 0-1 loss instead 
(or any upper bound such as hinge loss) where the loss at 
iteration t is: Lf{f) = E(,, ,)^p^,,,^T,„ [i{f, s, a, s% for 
£ the particular classification loss. This corresponds to an 
online classification problem. For many model classes, the 
negative log likelihood and convex upper bounds on the 0- 
1 loss (such as hinge loss) lead to convex online learning 
problems, for which no-regret algorithms exist (e.g., gra- 
dient descent, FTRL). As shown below, if the sequence of 
models is no-regret, then performance can be related to the 
minimum KL divergence (or classification loss) achievable 
with model class T under the overall training distribution 
P = 71 E!Ii P^ ('-e- a quantity akin to e^i or e;;;^^! for 
Batch). 

Algorithm 1 DAgger algorithm for Agnostic MBRL. 
Input: exploration distribution ly, number of iterations 
N, number of samples per iteration m, cost function 
C, online learning procedure OnlineLearner, opti- 
mal control procedure OC SOLVER. 

Get initial guess of model: f^ ^ OnlineLearner(). 

TTi ^OCSOLVER(f\C). 

for n = 2 to TV do 
for fc = 1 to m do 

Withprob. ^ sample {s,a) ^ £'^,7r„_i using 7r„_i, 
otherwise sample (s, a) ^ v. Obtain s' ^ Tga 
Add (s, a, s') to I?„-i. 
end for 

Update model: T" ^ OnlineLearner(I?„_i). 
7r„ .^OCSoLVER(r",C). 
end for 
Return the sequence of policies tti-m- 



4.1. Analysis 

Similar to our analysis of Batch, we seek to answer the 
following: if there exists a low error model of training data. 



and we solve each OC problem well, what guarantees does 
DAgger provide on control performance? Our results show 
that by sampling data from the learned policies, DAgger 
provides guarantees that have no train-test mismatch factor, 
leading to improved performance. 

For any policy tt', define e^J = ^ Eili IEs~m[^*('S) " 
V^ (s)], where Vi and V^ are respectively the value func- 
tion of -Ki and tt' under model T*. This measures how well 
we solved each OC problem on average over the iterations. 
For instance, if at each iteration i we found an e^-optimal 
policy within some class of policies 11 on learned model T*, 
then Cqc < j^ X]i=i ^i for all ""' G 11. As in Batch, the av- 
erage predictive error of the models T^'-^ can be measured 
in terms of the Li distance between the predicted and true 
next state distribution: z\l^ = jjY.f=i'^(s,a}r~.pA\\Tla - 
Tsa 111]. However, as was discussed, the Li distance is not 
observed from samples which makes it hard to minimize. 
Instead we can define other measures which upper bounds 
this Li distance and can be minimized from samples, such 
as the KL divergence or classification loss: i.e. e^ = 



F EtiE(«,a)~P.,.'~T.Jlog(T,„(s)) - \og{tUs'))] and 
^is _ i^N „ ^T^^[l{f\s,a,s')]. Now, 



■^pi-d 



h Y.i=l E(s,a) 



N 



^Pi,s' 



given the sequence of policies tti-jv, let tt = 
argmin^g^^^ Jp{tt) be the best policy in the sequence and 
7f the uniform mixture policy on the sequence. 

Lemma 4.1. The policies tti-n are s.t. for any policy n': 
This also holds as a function ofe^l"^ or e^^j using hem. 3.1. 



We note that e^ 



^Er=i^P(T^)-iP(7^)and 



N 



^prf — jq ^i=i Lf''{T^). Using a no-regret algorithm on 
the sequence of losses L^.^ implies -^ X]i=i Lf^{T^) < 
inf T'er w Eti Lf'^iT) + e^, for ef^ the average re- 



gret of the algorithm after N iterations, s.t. e. 



KL 



as 



KL 



N -^ 
class T: Ej 

iog(rL(s'))r 

define ej^^^ji == 



This relates eS to the modeling error of the 



^prd 

= inf^/gT-E 

■ j,KL < j,KL 



(s,a)r^p,s'r^Tsa 
,.KL fQj.gKL 



"rgf 



inf 



El 



T'eT^[s.a) 



\0g{Tsa{s)) - 

j.g( ->■ 0. Similarly 

[i{T',s,a,s')] and 



pels 



pels 



prd — mdl 



"igt 



by using a no-regret algorithm on iff^, 
for ej:'^ -^ 0. In some cases, even if the Li distance cannot 
be estimated from samples, statistical estimators can still 
be no-regret with high probability on the sequence of loss 
LWT) = E(,,,)^pJ||T,a - T;J|i]. This is the case in fi- 
nite MDPs if we use the empirical estimator of T based on 
data seen so far (see supplementary material). If we define 



^mdl 



— inf' 



T'gT 



E 



^.<t' 



mdl 



^rgt' 



(s,a)~p| 

fore^; 



\Tsa — 2^ialli]' this implies that 
-> 0. Our main result follows: 



Theorem 4.1. The policies tti-^v are s.t. for any policy tt': 
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e'^^ (or e"'' 






This also holds as a function of I^^i -r c , ( ut c^i , <. , 
using Lem. 3.1. If the fitting procedure is no-regret w.r.t the 
sequence of losses L^.]y (or L^.^, Lf.%}, then e^^, — ?► (or 
4^ ^ 0,e^^", ^ 0) as N -^oo. 

Additionally, the performance of ttjv can be related to W if 
the distributions Dfj^^T^ converge to a small region: 

Lemma 4.2. If there exists a distribution D* and some 



e* > s.t. \/i, llZJu.Tr,- 



D* 



< C 



£*„,, for some 



sequence {Ccnvli^i ^^'^^ '■^ ''(1)> ^''^" '^N is s.t.: 



Jt,{T^N) < Jt,{T^) + 



yrng 



If* 

2(1-7)^ ™' 



,A' 



1 ^ 

-Ye' 1 
1=1 



Thm. 4. 1 illustrates how we can reduce the original MBRL 
problem to a no-regret online learning problem on a par- 
ticular sequence of loss functions. In general, no-regret al- 
gorithms have average regret of 0(-^) iO{jj) in ideal 
cases) such that the regret term goes to at a similar 
rate to the generalization error term for Batch in Cor. 3.1. 
Here, given enough iterations, the term c'^ ffemdi deter- 
mines how performance degrades in the agnostic setting (or 

cj^ iJw 2e,^Jji or 2cJ^ -ffCmdi if we use a no-regret algorithm 
on the sequence of KL or classification loss respectively). 
Unlike for Batch, there is no dependence on c^, only on 
cJJ . Thus, if a low error model exists under training distri- 
bution p, no-regret methods are guaranteed to learn policies 
that performs well compared to any policy tt' for which c^ 
is small. Hence, u is ideally ZJ^.tt of a near-optimal policy 
TT {i.e. explore where good poUcies go). 

Finite Sample Analysis: A remaining issue is that the cur- 
rent guarantees apply if we can evaluate the expected loss 
(LY, Lf^ or Lf'^) exactly. This requires infinite samples 
at each iteration. If we run the no-regret algorithm on esti- 
mates of these loss functions, i.e. loss on m sampled tran- 
sitions, we can still obtain good guarantees using martin- 
gale inequalities as in online-to-batch (Cesa-Bianchi et al., 
2004) techniques. The extra generalization error term is 

typically Q(\/ '°^|,f^ ) with high probability 1 - 5. While 
our focus is not on providing such finite sample bounds, we 
illustrate how these can be derived for two scenarios in the 
supplementary material. For instance, in finite MDPs with 
\S\ states and |^| actions, if T* is the empirical estimator 
of T based on samples collected in the first i — 1 iterations, 

then choosmg m = 1 and A* m 0{ — ° e^d- l^ > guar- 
antees that w.p. 1 — (5, for any policy tt': 

Jm W < J^i^) < M^') + € + Oicl'e) 
Here, emdi does not appear as it is (realizable case). Given 
a good state-action distribution i', the sample complexity to 



get a near-optimal policy is 0( - ^2(i_ \i )• This im- 
proves upon other state-of-the-art MBRL algorithms, such 

as i?„,ax, 0(^^^^^|^^) (Strehl et al., 2009) and a 

recent modification of R^^^, 0( '^'"'='fJ[fi'°j^i^^'^^ ) (Szita 
& Szepesvari, 2010) (when |5| < -jr^)- Here, the de- 
pendency on jS'Pl^l is due to the complexity of the class 
(JS'PIAI parameters). With simpler classes, it can have no 
dependency on the size of the MDR In the supplementary 
material, we analyze a scenario where T is a set of kernel 
S VM (deterministic models) with RKHS norm bounded by 

K. Choosing m = 1 and N in 0{ — '^ £2n_ )4 ) guar- 
antees that w.p. 1 — (5, for any policy tt': 

J^(7r) < J^(vf) < J^(7r') + -et + 2cl m% + 0{cl e), 

for ej^^^j the multi-class hinge loss on the training set after 
N iterations of the best SVM in hindsight. Thus, if we 
have a good exploration distribution and there exists a good 
model in T for predicting observed data, we obtain a near- 
optimal policy with sample complexity that depends only 
on the complexity of T, not the size of the MDR 

5. Discussion 

We emphasize that we provide reduction-style guarantees. 
DAgger may sometimes fail to find good policies, e.g., 
when no model in the class achieves low error on the train- 
ing data. However, DAgger guarantees that one of the fol- 
lowing occur: either (1) we find good policies or (2) no 
models with low error on the aggregate dataset exist. If 
the latter occurs, we need a better model class. In contrast. 
Batch can find models with low training error, but still fail 
at obtaining a policy with good control performance, due 
to train/test mismatch. This occurs even in scenarios where 
DAgger finds good policies, as shown in the experiments. 

DAgger needs to solve many OC problems. This can be 
computationally expensive, e.g., with non-linear or high- 
dimensional models. Many approximate methods can be 
used, e.g., policy gradient (Williams, 1992), fitted value it- 
eration (Szepesvari, 2005) or iLQR (Li & Todorov, 2004). 
As the models often change only slightly from one itera- 
tion to the next, we can often run only a few iterations of 
dynamic programming/policy gradient from the last value 
function/policy to obtain a good policy for the current 
model. As long as we get good solutions on average, e^^ 
remains small and does not hinder performance. 

DAgger generalizes the approach of Atkeson & Schaal 
(1997) and Abbeel & Ng (2005) so that we can use any no- 
regret algorithm to update the model, as well as any explo- 
ration distribution. A key difference is that DAgger keeps 
an even balance between exploration data and data from 
running the learned policies. This is crucial to avoid set- 
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tling on suboptimal performance in agnostic settings as the 
exploration data could be ignored if it occupies only a small 
fraction of the dataset, in favor of models with lower error 
on the data from the learned policies. With this modifica- 
tion, our main contribution is showing that such methods 
have good guarantees even in agnostic settings. 

6. Experiments on Helicopter Domain 

We demonstrate the efficacy of DAgger on a challenging 
problem: learning to perform acrobatic maneuvers with a 
simulated helicopter, using the simulator of Abbeel & Ng 
(2005), which has a continuous 21 -dimensional state and 
4-dimensional control space. We consider learning to 1) 
hover and 2) perform a "nose-in funnel" maneuver We 
compare DAgger to Batch with several choices for h': I) lyt'- 
adding small white Gaussian noise^ to each state and action 
along the desired trajectory, 2) i/^: run an expert controller, 
and 3) i^en'- run the expert controller with additional white 
Gaussian noise*" in the controls of the expert. The expert 
controller is obtained by linearizing the true model about 
the desired trajectory and solving the LQR (iLQR for the 
nose-in funnel). We also compare against Abbeel's algo- 
rithm, where the expert is only used at the first iteration. 

Hover: All approaches begin with an initial model 

Axt+i = AAxt + BAut, for Axt the difference between 
the current and hover state at time t, Aut the delta con- 
trols at time t, A is identity and B adds the delta controls 
to the actual controls in Axt. We seek to learn offset ma- 
trices A', B' that minimizes ||Aa;t+i - [{A + A')Axt + 
{B + B')Aut]\\2 on observed data^. We attempt to learn 
to hover in the presence of noise*^ and delay of and 1 . A 
delay of 1 introduces high-order dynamics that cannot be 
modeled with the current state. All methods sample 100 
transitions per iteration and run for: 50 iterations when de- 
lay is 0; 100 iterations when delay is 1. Figure 2 shows 
the test performance of each method after each iteration. 
In both cases, for any choice of v, DAgger outperforms 
Batch significantly and converges to a good policy faster 
DAgger is more robust to the choice of v, as it always ob- 
tains good performance given enough iterations, whereas 
Batch obtains good performance with only one choice of 

^Covariance of 0.0025/ for states and 0.0001 J for actions. 

*Covariance of 0.0001/. 

'We also use a Frobenius norm regularizer on A' and B': 

ininA',B' ^E'^^iWAx',^ [{A + A')Axi + {B + B')Au.]\\2 + 
-^{\\A'\\% + ||B'|||),for A = 10"^, n the number of samples 

and (Axi, Aui,Axi) the i*** transition in the dataset. During 
training we stop a trajectory if it becomes too far from the hover 
state, i.e. if |j[A2:; A«]||2 > 5 as this represents an event that 
would have to be recovered from. During testing, we run the tra- 
jectory until completion (400 timesteps of 0.05s, 20s total). 

^White Gaussian noise with covariance / on the forces and 
torques applied to the helicopter at each step. 



1/ in each case. Also, DAgger eventually learns a policy 
that outperforms the expert policy (L). As the expert pol- 
icy is inevitably visiting states far from the hover state due 
to the large noise and delay (unknown to the expert), the 
linearized model is not as good at those states, leading to 
slightly suboptimal performance. Thus DAgger is learning 
a better linear model for the states visited by the learned 
policy which leads to better performance. Abbeel's algo- 
rithm improves the initial poUcy but reaches a plateau. This 
is due to lack of exploration (expert demonstrations) after 
the first iteration. While our objective is to show that DAg- 
ger outperforms other model-based approaches, we also 
compared against a model-free policy gradient method sim- 
ilar to CPI*^. However, 100 samples per iteration were insuf- 
ficient to get good gradient estimates and lead to only small 
improvement. Even with 500 samples per iteration, it could 
only reach an avg. total cost ^15000 after 100 iterations. 

Nose-In Funnel: This maneuver consists in rotating at 
a fixed speed and distance around an axis normal to the 
ground with the helicopter's nose pointing towards the axis 
of rotation (desired trajectory in Fig. 1). We attempt to 
learn to perform 4 complete rotations of radius 5 in the 
presence of noise'" but no delay. We start each approach 
with a linearized model about the hover state and learn a 
time-varying linear model". All methods collect 500 sam- 
ples per iteration over 100 iterations. Figure 2 (bottom) 
shows the test performance after each iteration. With the 
initial model (0 data), the controller fails to produce the 
maneuver and performance is quite poor. Again, with any 
choice of i', DAgger outperforms Batch, and unlike Batch, 
it performs well with all choices of ly. A video comparing 
qualitatively the learned maneuver with DAgger and Batch 
is available on YouTube (Ross, 2012). Abbeel's method 
improves performance slightly but again suffers from lack 
of expert demonstrations after the first iteration. 

7. Conclusion 

We presented a no-regret online learning approach to 
MBRL that has strong performance, both in theory and 
practice, even in agnostic settings. It is simple to imple- 
ment, formalizes and makes algorithmic the engineering 
practice of iterating between controller synthesis and sys- 
tem identification, and can be applied to any control prob- 
lem where approximately solving the OC problem is feasi- 
ble. Additionally, its sample complexity scales with model 



Same as CPI, except gradient descent is done directly on de- 
terministic linear controller. We solve a linear system to estimate 
the gradient from sample cost with perturbed parameters. 

'"Zero-mean spherical Gaussian with standard deviation 0. 1 on 
the forces and torques applied to the helicopter at each step. 

"For each time step t, we learn offset matrices A't, B[ such 



{A + A't)Axt + {B + B't)Aut + xl 



that Alt- 
art the desired state at time t and A, B the given hover model. 
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Figure 2. Average total cost on test trajectories as a function of 
data collected so far, averaged over 20 repetitions of the experi- 
ments, each starting with a different random seed (all approaches 
use the same 20 seeds) From top to bottom: hover with no delay, 
hover with delay of 1, nose-in funnel. Dt, De and D^n denotes 
DAgger using exploration distribution Ut, v^ and v^n respectively, 
similarly Bt, B^ and Ben for the Batch algorithm, A for Abbeel's 
algorithm, and L for the linearized model's optimal controller. 



class complexity, not the size of the MDP. To our knowl- 
edge, this is the first practical MBRL algorithm with agnos- 
tic guarantees. The only other agnostic MBRL approach 
we are aware of is a recent agnostic extension of i?niax 
(Szita & Szepesvari, 201 1) that is largely theoretical: it re- 
quires unknown quantities to run the algorithm {e.g., dis- 
tance between the real system and the model class) and its 
sample complexity is exponential in the class complexity. 
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